s3parquet destination + OOM tuning (final stack layer) — fixes 3 pre-existing test failures by randomizedcoder · Pull Request #29 · randomizedcoder/xtcp2

randomizedcoder · 2026-06-15T03:35:18Z

Summary

s3parquet-destination — the final stack layer (32 commits). Adds a direct Parquet → MinIO/S3 destination (retiring the Vector sidecar), plus clickhouse/kafka/MinIO OOM tuning, a fail-early capability check, Pyroscope continuous profiling, and the long-soak/parquet microVM flavors. After this merges, main contains the entire advanced stack.

Conflicts resolved

cmd/xtcp2/xtcp2.go / xtcp2_test.go / nix/microvms/mkVm.nix: pure gofmt/alignment divergence (main vs the stack base had identical fields, different whitespace) + the s3parquet additions — took the s3parquet version, then re-formatted.

Pre-existing test failures found in the branch tip — fixed here

The s3parquet layer shipped with 3 failing test packages (confirmed at the pristine s3parquet-destination tip via a clean worktree — the stack's CI only ran them inside the capability-granting microVM):

cmd/xtcp2 TestPrintFlags / TestBuildConfig — nil-deref: the fixtures never allocated the 4 new pyroscope* mainFlags pointers that printFlags/buildConfig dereference. → allocate them.
pkg/xtcp / cmd/ns — the new fail-early capability check (Init → checkCapabilities → x.fatalf) os.Exit'd the test binary on sandboxes lacking CAP_SYS_ADMIN/CAP_NET_ADMIN (both NewXTCP and NewNsTestingXTCP call Init). → add a SetCapabilityCheck test seam (matching the existing constructorRegistry/netNsCandidateDirs seams) + a TestMain in each package that installs a no-op. Production fail-fast is unchanged; the cap logic is still tested directly in init_capabilities_test.go.
TestS3ParquetDest_corner_queueFull — flaky: a 2 s safety deadline tripped under full-suite parallel load. → widen to 30 s (happy path unaffected).

Testing

Binary-blob guard: clean. go vet ./... + gofmt -l . + repo nix-fmt check: clean.
go test -ldflags=-checklinkname=0 -tags 'dest_kafka dest_nats dest_nsq dest_valkey dest_s3parquet' ./... — entire suite green across repeated runs (incl. the now-robust queueFull test).
Not booted here (KVM/docker-in-VM heavy): the s3parquet/clickhouse-pipeline and capcheck-fail microVM flavors — those exercise the capability check + the MinIO/parquet round-trip end-to-end and remain microVM-verified.

🤖 Generated with Claude Code

New `-dest s3parquet:<endpoint>` destination accumulates ProtobufList Envelopes into in-memory Parquet builders, finalizes at a configurable byte threshold (default 63 MiB), and PUTs the object via minio-go to S3-compatible storage. Hive-partitioned object keys (host=…/date=…/hour=…) keep `s3()` table-function consumers and DuckDB pruning cheap. Architecture: - Async worker owns the parquet-go GenericWriter and minio client; Send only marshals the bytes into a bounded queue (16 slots) and returns, so the Poller never stalls on uploads. Queue-full bumps a Prom counter and falls through to a blocking send (back-pressure visible without data loss). - Hand-written ParquetRow struct mirrors XtcpFlatRecord with parquet: snake_case + per-column codecs (ZSTD for strings/bytes, SNAPPY for numeric/timestamp). A drift test reflects the proto FileDescriptorSet at unit-test time and fails CI if columns diverge. - Object keys sanitized for path traversal / NUL / control chars before they touch the S3 PutObject call; a hacker-attacker test suite asserts ../../etc/passwd-style hostnames cannot escape the prefix and S3 secret keys never appear in error paths. CLI / config: -s3Endpoint, -s3Bucket, -s3Prefix, -s3AccessKey, -s3SecretKey, -s3Region, -s3ParquetFlushBytes (+ S3_* env overrides; secrets logged as "set" only). Proto fields s3_endpoint=125 … s3_region=133 (130/131 skipped to avoid the existing `dest` slot). Vector retired in the same commit: - vector-pipeline.nix, xtcp2-vector-path.nix, self-test-vector.nix deleted; vector branches in mkVm.nix / microvms/default.nix / nix/default.nix removed (isVector, vectorModules, xtcp2VectorArgs, vmsVector, lifecycleVector, checksVector, microvm-x86_64-vector, microvm-x86_64-lifecycle-vector). Vector was misconfigured for the ProtobufList envelope wire format and wrote JSON, not Parquet — s3parquet supersedes its intended role with one fewer process and no descriptor-set mount. - mkProtoDescSet helper and the `xtcp-flat-record-desc` package remain exposed for external consumers that still want the .desc artifact. Microvm flavor `s3parquet` (sink="s3parquet") reuses the existing minio-bucket-bootstrap module. Lifecycle self-test adds two sentinels: S3PARQUET_FILES_PASS (≥1 .parquet object in MinIO within 90 s) and S3PARQUET_ROWS_PASS (DuckDB row count ≥1 from the produced object). Both pass in CI with 1204 rows landed after a 60 s boot. Test coverage in all six categories (positive/negative/boundary/corner/ adversarial/hacker-attacker) plus Benchmarks and a concurrent sends+close race test under `-race`. Validator change: schemeS3Parquet joins the path-style exemption in input_validation.go since the endpoint URL (http://host:port) has its own colons; the strict x2-colon rule still applies to kafka/nats/nsq/ valkey/udp. Vendor hash + allLibraryDestinations updated for the new minio-go + parquet-go deps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…te heartbeats Adds a new microvm flavor `s3parquet-long` paired with `mkS3ParquetRunner` (`nix run .#microvm-x86_64-s3parquet-runner -- --duration <5m|12h|…>`). Mirrors the existing soak/tcp-stress runner pattern: boots the VM, sleeps for the requested duration, prints a heartbeat each 30 s (short runs) / 5 min (long runs), then powers off with a markdown-style summary table of per-sentinel file deltas. Flavor mechanics: - `sink = "s3parquet-long"` reuses the existing minio-bucket-bootstrap module, the s3parquet destination, and the soak nsTest/tcp_server/ tcp_client traffic generators (so xtcp2 always has a populated netlink readout to feed the parquet writer). 1 MiB flush threshold keeps the file count visible at short durations; edit xtcp2S3ParquetLongArgs to 67108864 (or omit the flag) for production 63 MiB testing. - Self-test is skipped (`!isSoak && !isS3ParquetLong`); a new systemd unit `xtcp2-s3parquet-monitor.service` emits one sentinel line per `S3PARQUET_REPORT_INTERVAL` seconds (default 60 s): XTCP2_S3PARQUET_HOURLY <ts> files=<n> bytes=<n> rows=<n> - The monitor sources its numbers from xtcp2's own Prometheus counters (`destS3Parquet/upload`, `uploadBytes`, `uploadRows`) via `curl /metrics`. An earlier `mc find` implementation was too slow under nsTest load — Prometheus is authoritative and ~1 ms per scrape. Runner mechanics: - Reads heartbeat counts off the in-VM sentinels in the serial transcript (host-side mc through the forwarded port doesn't actually route in this microvm setup — qemu reports the port as LISTEN but curl times out). - `--report-interval` is honored only as a sanity check in the summary's min-expected-reports math; the in-VM cadence is baked at build time. - `--rss-cap-mb` parameter wired but inactive (RSS scrape from the host requires VM introspection we don't have); kept as a hook for a follow-up. - Summary: total files, total bytes, total rows, panics, restarts, and the full per-sentinel delta table. Bucket-bootstrap module now binds MinIO to 0.0.0.0 instead of 127.0.0.1 so the (currently disabled) host-side forwarded-port path would work if microvm.nix's hostfwd routing ever gets fixed. Inside the VM nothing changes — xtcp2 still talks to MinIO via 127.0.0.1. Phase B (5 m): 52 files PASS. Phase C (30 m): 366 files PASS, steady ~12-14 files/min delta, zero panics/restarts, in-VM memory stable. Phase D (2 h at production 63 MiB) and Phase E (12 h, production- shaped) remain user-triggered: nix run .#microvm-x86_64-s3parquet-runner -- --duration 12h The defaults give ~12 files/min at 1 MiB threshold; switch the threshold in xtcp2S3ParquetLongArgs for production-size objects. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Before launching the 12 h soak. At the steady ~1 MB/min raw-row rate observed in the 30 min smoke, a 12 h run produces ~12 finalized objects — matches the user's "multiple files after 12 hours" expectation and exercises the production-sized object path the 1 MiB smoke threshold doesn't. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds a Pyroscope-go agent inside xtcp2 and an in-VM Pyroscope OSS server so operators can stream and visualise CPU, alloc, in-use, goroutine, mutex, and block profiles without a separate profiling infrastructure. Motivated by the 12 h s3parquet-long soak hitting `fatal error: thread exhaustion` at 1h 45min — a goroutine/thread leak in the namespace-handler hot path that pprof one-shots couldn't localize. xtcp2 (Go): - New deps: github.com/grafana/pyroscope-go + godeltaprof - New CLI flags + proto fields (136-139): -pyroscopeUrl (empty disables the agent) -pyroscopeAppName ("xtcp2" by default) -pyroscopeSampleHz (100 Hz default) -pyroscopeUploadSec (15 s default) All five profile types start when -pyroscopeUrl is non-empty. Empty URL is zero-overhead — production runs that don't want the agent simply leave the flag unset. Secrets aren't applicable (Pyroscope endpoints are usually authenticated by network policy or a sidecar; we don't ship credentials in argv). NixOS module nix/modules/pyroscope-server.nix: - Wraps services.pyroscope to run a single-binary all-in-one server with filesystem-backed storage (no external S3/Azure blob dependency). Listens on 0.0.0.0:14040 in the VM (4040 is occupied by something else inside the NixOS boot; investigated briefly then sidestepped — 14040 works cleanly). - Drops DynamicUser → runs as root inside the disposable VM so writes to /var/lib/pyroscope/blocks succeed without the nixpkgs default's StateDirectory-vs-tmpfs choreography. - Forces stderr/stdout onto journal+console so future startup failures surface on the serial transcript (the default journal-only logging hid the real "bind: address already in use" diagnostic across three earlier debugging cycles). Microvm wiring: - s3parquet-long flavor imports pyroscope-server.nix and passes -pyroscopeUrl http://127.0.0.1:14040 -pyroscopeAppName xtcp2.s3parquet-long into xtcp2's extra args. - Forwards host:14040 → guest:14040 so an operator can hit the Pyroscope UI at http://127.0.0.1:14040 if QEMU hostfwd is working (it intermittently isn't in this microvm setup, but the agent still streams profile data inside the VM regardless). - In-VM monitor now also emits go_goroutines + go_threads in the XTCP2_S3PARQUET_HOURLY sentinel — a per-minute leak indicator visible directly in the runner summary without needing the Pyroscope UI. Phase G validation: 30 min s3parquet-long soak PASS, 6 finalised 63 MiB parquet objects, 0 panics, 0 restarts, Pyroscope agent shipping all five profile types every 15 s. Ready for the follow-up 2+ hour leak-diagnosis run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The 12 h s3parquet-long soak hit `fatal error: thread exhaustion` at 1 h 45 min — over 2000 OS threads accumulated. Pyroscope's live goroutine profile (now reachable from the host via the firewall fix in the same change) showed the leaking call site clearly: 50 goroutines @ ns_net_namespace.go:141 (<-nsCtx.Done()) 33 goroutines @ ns_net_namespace.go:281 (Setns backoff) each holding runtime.LockOSThread() The deferred restore-netns Setns kept failing with EPERM under nsTest churn at 250 ms cadence. The previous code accepted this: counted the error, kept the goroutine going, then UnlockOSThread'd the *tainted* M (now in a deleted netns) back to Go's scheduler. The runtime tried to reuse it, hit the wrong-netns mismatch on the next syscall, and was forced to spin up a fresh M every time — growing the M-pool past the SetMaxThreads(2000) ceiling. Fix: make UnlockOSThread conditional on the restore Setns succeeding. On EPERM we skip the unlock — the goroutine exits while still holding the lock, and the Go runtime terminates the OS thread (documented runtime.LockOSThread behaviour) instead of recycling a tainted M. Cost: one OS-thread creation per failed restore (~10 µs). At 4 ns events/sec for 1 h that's ~14 k thread creations totalling ~140 ms of overhead. Versus the prior unbounded accumulation leading to crash, the trade is obvious. Other observability landings in this commit that supported the diagnosis: - nix/microvms/mkVm.nix: open the s3parquet/MinIO/Pyroscope ports in networking.firewall.allowedTCPPorts so QEMU usermode hostfwd packets actually reach the listeners (the previous firewall block only enumerated tcp-stress + clickpipe). curl/browser from the host can now hit pyroscope :14040, MinIO :9000/:9001, and xtcp2 /metrics + /debug/pprof on :9088. - cmd/xtcp2: register net/http/pprof side-effect import so /debug/pprof/{goroutine,heap,…} is available on the prom port without standing up a separate debug server. Used to capture the goroutine stack distribution that pointed at the leak. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous commit fixed the M-recycling symptom; this commit fixes the root cause. Pyroscope diagnostics from the validation run exposed the actual signal: restoreNs error: 12116 (100% of 12,116 attempts failed) restoreNs count: 0 Every single setns(CLONE_NEWNET) restore was failing with EPERM. Decoding xtcp2's init-time capability dump (Effective = 0x1003000) confirmed why: the service only had CAP_NET_ADMIN + CAP_NET_RAW + CAP_SYS_RESOURCE. setns(CLONE_NEWNET) requires CAP_SYS_ADMIN in the target netns's userns; without it, both the initial setns into a new ns AND the restore back to the original ns fail. The retry loop in openAndSetNSWithRetries spun all 10 attempts under EPERM holding a LockOSThread'd OS thread; the previous unconditional defer UnlockOSThread (now conditional) handed the tainted M back to the scheduler; thread count grew without bound; SetMaxThreads(2000) ceiling crashed the daemon at 1h 45min under nsTest's 4-evts/sec churn. clickhouse-pipeline runs survive 12+ h because clickpipe doesn't run nsTest churn — its namespace surface is whatever docker creates (handful of containers, minutes between events). Soak + s3parquet-long both run nsTest at 250 ms cadence and hit the wall. Granting CAP_SYS_ADMIN means setns succeeds on the first attempt, restore succeeds, M is properly recycled by the runtime, thread count stays bounded by the active-namespace working set (~50-300 in steady state, not unbounded growth). The conditional UnlockOSThread from the prior commit remains as defense-in-depth for any future environment where CAP_SYS_ADMIN is dropped or scoped differently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…Setns pattern Two layers of defense against re-introducing the OS-thread leak that crashed the 12 h s3parquet-long soak: 1. **Regression test** `pkg/xtcp/ns_thread_leak_test.go`: - Uses the existing test seam (now extended with a restoreNsSetns hook) to force the restore-Setns to return EPERM, mirroring the production microvm scenario. - Runs N=400 iterations of the LockOSThread + restore-fails + exit pattern with `debug.SetMaxThreads(150)` so any leak panics immediately instead of looking slow. - Asserts /proc/self/status:Threads delta stays ≤ 80 across the run. Without the fix the test would either panic on the thread cap or fail the delta bound. With the fix delta=1 in practice. 2. **Forbidigo linter rule** in .golangci.yml: - Bans bare `runtime.UnlockOSThread`. Callers must opt in with `//nolint:forbidigo // <reason>` documenting why the unlock is safe in that context. Forces the next person who writes a LockOSThread/Setns pairing to confront the bug class at the line they're writing. - The rule immediately caught a SECOND occurrence in `pkg/xtcp/ns_watch.go::createNetworkNamespace` — same bug, same fix (conditional unlock inside the restore defer). - All legitimate uses (io_uring SQ-thread pinning, CPU-pin in bench tests) annotated with nolint + justification. Together: the linter catches the static pattern at write time; the regression test catches the runtime behaviour if someone bypasses the linter. Either alone would be incomplete; together they cover both the "removed conditional" and "added unconditional" regression shapes. Includes: - Restore-Setns seam (`restoreNsSetns` var) in ns_net_namespace.go so tests can force the restore-failure code path without needing real CAP_SYS_ADMIN or live namespaces. - gofmt + goimports drift fixes in cmd/xtcp2 / xtcp2_test.go / udp_receiver_server_test.go that surfaced when the lint became stricter. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…k-fail microvm flavor When the 12 h soak crashed with `fatal error: thread exhaustion`, the missing CAP_SYS_ADMIN was the proximate cause but the user had to bisect across the runtime to find it. This commit makes that class of misconfiguration loud at startup instead of hours-later under stress. Go side (pkg/xtcp/init_capabilities.go): - Replace the legacy CAP_NET_ADMIN + CAP_SYS_CHROOT check (the chroot one was never actually used) with a structured requiredCaps table: CAP_NET_ADMIN fatal — no netlink inet_diag without it CAP_SYS_ADMIN fatal — setns(CLONE_NEWNET) needs it CAP_NET_RAW warn — raw-socket destinations fail CAP_SYS_RESOURCE warn — io_uring rings get bounded - Print a per-cap diagnostic at startup. On any fatal-tier capability missing, exit cleanly via x.fatalf() with a multi-line message that names each missing cap, explains the failure mode, AND emits a ready-to-paste systemd snippet so the operator can fix the config in one copy. Soft-required caps surface as warnings; daemon continues. - pkg/xtcp/init.go: promote checkCapabilities from log-only to fatal-exit. Hard-required missing caps refuse to start the daemon rather than letting it limp and crash later. Tests (pkg/xtcp/init_capabilities_test.go): - Rewrite over the new requiredCaps table. - New cases: hasAllRequired, hasEverything — happy paths missingNetAdmin — fatal diagnostic missingSysAdmin — fatal diagnostic (the original 12 h soak bug) missingOnlySoftCaps — warnings + nil err missingBothHardCaps — both named in err capgetErr — error wrapping - Each fatal-path assertion pins on the expected substring (capability name + remediation hint) so a regression in the message would surface in CI. Microvm wiring: - nix/modules/xtcp2-service.nix gains a `capabilities` option (defaults to the full set). The systemd unit uses it for both AmbientCapabilities and CapabilityBoundingSet so test flavors can drop one to validate the fail-early path. - mkVm.nix adds sink="capcheck-fail": same s3parquet-long config, but `services.xtcp2.capabilities` deliberately omits CAP_SYS_ADMIN. xtcp2.service then refuses to start; systemd prints the diagnostic to the serial console on each restart attempt. - Exposed as flake package microvm-x86_64-capcheck-fail. Verified end-to-end: booting microvm-x86_64-capcheck-fail shows the expected diagnostic on the serial transcript, and xtcp2.service enters a Restart=on-failure loop instead of the silent thread-leak behaviour it had before. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

User feedback: the microvm-x86_64-capcheck-fail flavor is overkill for what we're verifying — does the daemon exit with a clear per-capability diagnostic when CAP_SYS_ADMIN is missing? A normal `nix runCommand` derivation can spawn xtcp2 in the build sandbox (which runs as an unprivileged user with no elevated caps) and assert the same end-to-end behaviour in under a second. New checks (auto-added to `nix flake check`): - capability-check-no-caps names CAP_NET_ADMIN - capability-check-names-sys-admin names CAP_SYS_ADMIN Both spawn xtcp2 with -dest null -maxLoops 1 in the Nix sandbox. Without any privileged caps the startup checkCapabilities path fires, the daemon fatal-exits, and the test asserts the stderr contains the missing-cap name plus the systemd remediation snippet ("AmbientCapabilities", "CapabilityBoundingSet"). The pinned substrings would surface any future weakening of the diagnostic in CI. The microvm-x86_64-capcheck-fail flavor stays for full-stack validation (systemd ambient-cap config → xtcp2 → restart loop) but is no longer the routine check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… pipeline The prior 12 h validation proved the OS-thread leak is fixed (drift 277→317 over 11 h, vs the previous unbounded growth that crashed at 1 h 45 min). But it ran "FAIL: no parquet files landed" because most nsTest-churned namespaces are socket-empty, so xtcp2's per-namespace netlink poll returned nothing and the parquet writer had nothing to batch. The leak got fixed but the workload never stressed the codepath that broke. Three knobs to put genuine pressure on the same path: 1. soakInitialNs: 50 → 200 (4× concurrent namespace working set) 2. soakChurnSleep: 250 ms → 100 ms (2.5× ns event rate) 3. new xtcp2-soak-ns-traffic systemd unit (the big one) (3) is a small shell driver that continuously scans /run/netns/ and, for every nsX it finds, fires `ip netns exec <ns>` with a brief loopback ncat listener+connector pair INSIDE the namespace. The pair lives ~50 ms before the listener exits — long enough for xtcp2's next per-namespace netlink poll to catch the ESTABLISHED state, plus the subsequent TIME_WAIT. A concurrency cap of 30 in-flight injectors caps host fork pressure even with soakInitialNs=200. Net effect on the workload (vs prior run): - ns event rate: 4 evts/sec → 10+ evts/sec - in-flight namespaces: ~50 → ~200 - envelopeRows/12h: ~73 → expected many thousands - finalized parquet files/12h: 0 → expected ≥10 If the leak fix still holds under this load — and the parquet pipeline survives sustained envelope production for 12 h — the bug class is genuinely closed. If anything ELSE breaks (file descriptor limits, parquet builder memory, MinIO upload backpressure), we catch it here instead of in a customer deployment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

First aggressive 12 h soak attempt: the unit was SKIPped at boot with "Ordering cycle found, skipping xtcp2 soak — in-namespace TCP loopback injector". My `after = [xtcp2-soak-churn.service ...]` formed a cycle with the implicit multi-user.target dep chain. The driver script already handles `/run/netns/` being empty (sleeps 0.5 s and re-checks), so the dep was decorative — drop it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The aggressive 12 h soak ran flat files=0 despite the new soak knobs (200 ns, 100 ms churn, ns-traffic systemd service). Root cause: the previous shell-based injector lost the race between `ls /run/netns/` and `ip netns exec` — the ns was gone by the time exec ran ("Cannot open network namespace nsXX"). Bumping concurrency didn't help; the script's own bash interpreter wasn't even in PATH ("exec of bash failed"). Cleaner fix: have nsTest itself open the loopback connection immediately after `ip netns add`, in-process. No race possible (we hold the ns reference) and no PATH issues. Implementation: - New -traffic flag (default false). - After each `ip netns add`, on a LockOSThread'd goroutine: 1. snapshot origNs 2. setns into the new ns 3. `ip link set lo up` (shell — one-shot at ns-creation time, latency immaterial at ~10 creates/sec) 4. open net.Listen on 127.0.0.1:0 + net.Dial back to it + exchange one payload + close 5. setns back to origNs; conditional UnlockOSThread (same pattern as the netNamespaceInstance fix — on Setns restore failure leave the lock held so the runtime terminates the OS thread instead of recycling a tainted M) - Each TCP exchange leaves a TIME_WAIT pair in the ns's kernel socket table for ~60 s; the ns lives ~20 s under the soak's 100 ms churn cadence so xtcp2 sees socket state on every poll. Wiring: - soakChurnScript now passes -traffic to nsTest. - The old shell-based xtcp2-soak-ns-traffic systemd unit is left guarded behind `lib.mkIf false` — not deleted yet so future reference debugging can compare approaches. Sanity: 5 min smoke with -traffic produced Netlinker 2 packets:8, n:192, p:2, fd:... ns:/run/netns/ns86 vs the prior empty Netlinker N packets:Y, n:20, p:0, ... Files still 0 at 5 min because the 63 MiB flush threshold needs more accumulated envelope bytes — addressed by the upcoming 12 h soak which has the runtime to hit it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…d io profiles The prior -traffic mode opened one brief loopback exchange per ns (leaving a TIME_WAIT pair visible to inet_diag for ~60s). That worked but gave xtcp2 only ~2 sockets per ns with identical TCP_INFO. -conns 100 instead opens 100 listener+dialer pairs per ns and keeps them alive for the ns's lifetime; each conn picks a profile from the cross product of 5 payload sizes (16 B / 256 B / 4 KB / 16 KB / 64 KB) × 4 send intervals (1 / 10 / 100 / 500 ms), so the TCPInfo spread across 200 conns per ns is real. Lifecycle: - startPersistentTraffic spawns a setup goroutine on a LockOSThread'd thread: setns into the new ns, `ip link set lo up`, open all N listener+dialer pairs, setns back, conditional UnlockOSThread (same pattern as netNamespaceInstance — on Setns restore failure keep the lock held so the runtime terminates the OS thread). - Once the sockets are open the io goroutines do NOT need to be on the LockOSThread'd thread; the sockets carry their netns identity. So 2N io workers per ns × 200 ns = 20k goroutines, but only ~200 OS threads tied to ns work. - stopPersistentTraffic is called immediately before `ip netns del` in the churn loop: cancels the ns ctx, closes all sockets, bounded 2 s drain wait. Clean shutdown means no EBADF/EPIPE noise in the journal during normal churn. - Per-ns state lives in a sync.Map keyed by ns name. Wiring: - soakConnsPerNs = 100 added to mkVm.nix. - soakChurnScript invokes nsTest -conns ${soakConnsPerNs} (replaces the -traffic flag for the long-soak flavor; -traffic itself is kept for backward compat with shorter smoke flows that only need the one-shot TIME_WAIT injection). 5 min smoke under init-burst saw ~20k near-simultaneous connect() calls overwhelm the loopback path (dial timeouts on ~30% of init-fill conns). 2 s dial timeout + silent skip-on-fail handles the noise — by t=5 min the system is stable and producing files. 1 h test PASS: 10 parquet files / 52.9 MB, 0 panics, 0 restarts, threads stable at 1034. 6× the per-hour parquet throughput of the previous 12 h soak (which only managed 17 files in 12 h with the brief-injection -traffic mode). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… side Mixed flavor that runs the existing clickpipe stack (redpanda + clickhouse + grafana + prometheus, docker) PLUS in-VM MinIO + a second xtcp2 instance writing parquet directly to MinIO. Validates the "operator wants both wire formats out of one host" deployment shape and exercises ClickHouse's s3() table function against the parquet objects xtcp2 produces. What's wired: - sink = "clickhouse-pipeline-parquet" → mkOneClickPipeParquet. - isAnyClickPipe / isAnyS3Parquet convenience predicates so shared infra (docker volume, port forwards, firewall, prom/grafana, clickpipe-up unit, MinIO bucket bootstrap) lights up for both flavors via one gate change each instead of N matches. - New `systemd.services.xtcp2-parquet` unit, scoped to isClickPipeParquet: runs `${xtcp2Package}/bin/xtcp2` with xtcp2ClickPipeParquetArgs: -dest s3parquet:http://127.0.0.1:9000 -s3Bucket xtcp2-records -s3ParquetFlushBytes 4194304 (4 MiB; gives turnover within a 30 min smoke run) -promListen :9089 -grpcPort 8890 (off the primary's :9088/:8889) Same caps as the primary xtcp2 (CAP_NET_ADMIN + CAP_NET_RAW + CAP_SYS_RESOURCE + CAP_SYS_ADMIN). - Primary xtcp2 (kafka path, xtcp2ClickPipeArgs) runs unchanged. ClickHouse container gets `--add-host host.docker.internal:host-gateway` on its docker run so the s3() function can reach the in-VM MinIO at http://host.docker.internal:9000 from inside the bridge network. The mapping is a no-op for plain clickpipe runs that don't use s3(). self-test.nix gains a new optional `runClickhouseParquetCheck` param: - Check 15: `SELECT count() FROM s3('http://host.docker.internal:9000/ xtcp2-records/**/*.parquet', '…', '…', 'Parquet')` via the clickhouse container. Polls up to 90s for the first parquet object to land (4 MiB threshold). - Emits XTCP2_SELF_TEST_CLICKHOUSE_PARQUET_{PASS,FAIL}. Exposed at the flake level as: - packages.microvm-x86_64-clickhouse-pipeline-parquet - apps.microvm-x86_64-clickhouse-pipeline-parquet (boots the VM directly, same pattern as the plain clickhouse-pipeline app). Next: short hand-driven boot to verify both xtcp2 instances start cleanly and ClickHouse can resolve host.docker.internal, then wire into a proper lifecycle test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ower flush for smoke Two follow-ups after the first boot of the clickhouse-pipeline-parquet flavor: 1. The second xtcp2 instance was bound on :9089 / :8890 inside the VM but the host couldn't reach it because the QEMU hostfwd table only listed :9088 / :8889 (the primary instance's ports). Added matching forwardPorts entries + firewall openings under `lib.optionals isClickPipeParquet`. Operators can now hit http://127.0.0.1:9089/metrics for the parquet pipeline's prom counters side-by-side with :9088 for the kafka pipeline. 2. Dropped xtcp2ClickPipeParquetArgs's -s3ParquetFlushBytes from 4 MiB to 256 KiB. The mixed flavor exists primarily to validate the kafka + parquet + ClickHouse-reading-parquet plumbing in a short smoke; 256 KiB flushes within ~30 s of boot and gives the self-test check immediate signal. Production deployments using the same pattern should set this to the 63 MiB default by editing the flavor. End-to-end verified: ClickHouse's s3() table function reading from host.docker.internal:9000 (the in-VM MinIO via the bridge gateway alias added in the previous commit) now returns row counts from the xtcp2-written parquet objects. 600 rows in one parquet file at +90 s, alongside 72 rows in the kafka path (still ramping up the clickhouse kafka-engine consumer). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…+ 8 GB ClickHouse First 2 h mixed-flavor soak hit ClickHouse's container-level memory cap (3500m default from the clickpipe flavor) — 222 MEMORY_LIMIT_EXCEEDED errors over the run, blocking the kafka_engine MV. The parquet pipeline was unaffected (it writes through MinIO, not through ClickHouse) but the goal of the mixed flavor is to validate BOTH paths in one VM, so the kafka path needs room to operate. Two coupled changes: - constants.nix: new `memClickPipeParquet = 12288` (vs 6144 for plain clickpipe). Headroom for: ClickHouse (~5 GiB peak under the mixed load), Redpanda (~700 MiB), MinIO (~300 MiB growing), 2× xtcp2 instances (~500 MiB each), dockerd, page cache, kernel. - mkVm.nix: new `clickPipeClickhouseMemory` let-binding picks the container --memory= based on isClickPipeParquet — 8000m for the mixed flavor, 3500m for plain (unchanged, keeps the 12 h-validated budget). Wired into the docker run. The 12 GiB VM is non-trivial; the plain clickhouse-pipeline flavor keeps its 6 GiB budget so existing soak runs aren't perturbed. Only the mixed flavor takes the larger footprint, and it's the same order as a typical operator running clickpipe + parquet on one box. Next: re-run the 2h mixed soak with the bumped budgets and confirm kafka_engine MV catches up to xtcp2's produce rate alongside the parquet pipeline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…me OOMs Root cause of the persistent MEMORY_LIMIT_EXCEEDED storm in the mixed flavor was NOT MV / parts-merge memory pressure but ClickHouse's OWN observability tables. Under the mixed workload (2× xtcp2 + nsTest churn + kafka_engine + s3 reads) the periodic flushes into system.latency_log, metric_log, asynchronous_metric_log, processors_profile_log accumulate fast — then their background merges trip the per-server max-memory cap before the user kafka MV gets a chance. Bumping memory just raised the cap; the workload kept up. With config.d/disable_chatty_logs.xml mounted into the container, MEMORY_LIMIT_EXCEEDED dropped from 903 to ~28 over the same 15 min smoke window and xtcp.xtcp_flat_records started ingesting again (parquet path was always fine — s3() roundtrip returns ~22 k rows). Keep memClickPipeParquet=16384 / --memory=12000m as cheap insurance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ClickHouse's kafka_engine defaults `kafka_max_block_size` and `kafka_poll_max_batch_size` to `max_block_size` (65,505). With our ProtobufList wire format — each kafka message is an `Envelope` that expands into ~100–1000 `XtcpFlatRecord` rows — a single poll cycle wants to materialize 6.5M–65M rows in memory before flushing. That's what `StorageKafka::threadFunc` was OOMing on in the mixed clickhouse-pipeline-parquet flavor (~2500 sockets fattening envelopes). After the chatty-logs disable last commit, the remaining OOMs were all on this path. Capping to 256 messages/poll bounds the working set at ~256 × avg-envelope-size rows; the MV still flushes 64K-row blocks to the MergeTree, just one block at a time. Verified via `SHOW CREATE TABLE` on the live consumer and via err.log — `StorageKafka` no longer appears in the OOM stack traces. Doesn't (yet) fix the deeper MV-halt symptom: the consumer still hits intermittent ProtobufList BAD_ARGUMENTS errors when the proto file is briefly unavailable during the docker entrypoint's chown of /var/lib/clickhouse/format_schemas/. Tracking as a follow-up — the schema-race needs either a startup barrier or kafka_skip_broken_messages turned up so individual schema failures don't halt the consumer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…_size Two related changes to reduce the OOM pressure and stabilize the kafka_engine MV in the mixed clickhouse-pipeline-parquet flavor: * In `clickpipe-up`, after ClickHouse accepts queries, add a ProtobufList schema-warm probe (`SELECT * FROM xtcp.xtcp_flat_records LIMIT 0 FORMAT ProtobufList SETTINGS format_schema=...`). LIMIT 0 produces no rows but ClickHouse still constructs the ProtobufList output format object, which opens the proto file and resolves the message type. Forces the file to be in its final state (post entrypoint chown) before xtcp2 starts producing. * Lower `kafka_poll_max_batch_size` 256 → 64. With 256 the consumer drained the kafka backlog as fast as it could on first poll, overran the MergeTree's merge throughput, and the resulting parts- merge memory pressure OOM'd the consumer's next allocation. 64 smooths the insert rate enough that merges keep up. Combined effect at T+5m of a fresh boot: - chatty-logs only baseline: ch_rows=2584 OOMs=13 - + batch=256 (first attempt): ch_rows=7448 OOMs=826 (cascade) - + batch=64 + schema-warm: ch_rows=4871 OOMs=11 OOMs are now solidly in the single digits per 5min interval. Doesn't fully fix the kafka MV halt: the kafka_engine consumer still hits a `BAD_ARGUMENTS: Could not find a message named ...` on its SECOND poll batch (~1 min after producer starts). The schema-warm above proves the schema is loadable for SELECT...FORMAT, but the kafka_engine rebuilds its pipeline each flush_interval (5s) and re-loads the schema independently — that re-load occasionally fails. Next step (separate fix) is either kafka_skip_broken_messages > 0 so a transient schema-lookup failure isn't terminal, or a longer-living schema cache. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The SELECT...FORMAT ProtobufList probe goes through a different schema loader than kafka_engine — its source-tree importer reports 'CANNOT_PARSE_PROTOBUF_SCHEMA: File not found' for the same file that kafka_engine successfully parses moments later. The probe was failing every boot for the full 30s window, clickpipe-up.service exited with FATAL, and xtcp2 started anyway because `After=` is permissive. So the OOM improvements that landed in 72b2dd2 are entirely from kafka_poll_max_batch_size=64 — keep that. The probe code was dead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

After capturing the full ClickHouse server log on a fresh boot, the schema errors and the apparent MV halt have a much simpler explanation than what the prior commit messages (8db5dbd, 72b2dd2) claimed: 1. The "Could not find a message named ..." errors in system.kafka_consumers are NOT a ClickHouse-25.3 ProtobufList cache bug, and they're NOT a recurring runtime issue. They come from the official docker entrypoint's pattern of running a temporary 127.0.0.1-only server to execute the initdb scripts, then SIGTERMing it before starting the real server. Our kafka_engine table attaches in that temp server, the consumer thread loads the schema during shutdown, fails BAD_ARGUMENTS, and the failure entry sticks around in system.kafka_consumers.exceptions (capped at 10 entries) — but the consumer in the second/real server starts clean and runs fine. You can see two `Application: Starting ClickHouse` events in clickhouse-server.log, ~3 s apart, every boot. 2. The "MV halts at N rows" symptom across the 30-min probe windows wasn't a halt — `Pushing N rows ... took 37152 ms` / 146775 ms entries in the log show individual kafka_engine flushes are taking 30-150 s each under the mixed flavor's ingest rate. ch_rows incrementing by ~2.4 k every 30 min IS the consumer running normally, just slowly. last_poll_time stays current. The code changes from those commits are still correct: the OOM mitigations (kafka_poll_max_batch_size=64, chatty-logs disable) really do reduce MEMORY_LIMIT_EXCEEDED pressure end-to-end. But the rationale attached to 72b2dd2 about kafka_engine reloading the schema per flush_interval is wrong — remove the bogus claim from the SQL comment and document the actual root cause in docs/integration-testing.md so the next person investigating doesn't go down the same rabbit hole. The remaining open question — why each MV flush is so slow (122-column ZSTD MergeTree insert of a few k rows takes tens of seconds) — is a real follow-up worth profiling, but it's perf, not correctness. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two coordinated changes that together unlock substantially higher MV throughput in the mixed flavor. Container memory: 12000m → 14000m ClickHouse's internal max-memory cap is ~88 % of the container limit. At 12000m the cap was 10.55 GiB and CH's baseline MemoryTracking parked at 10.45 GiB constantly. The kafka_engine's per-batch 131 MiB protobuf decode buffer allocation was rejected ~2 %/min — those messages routed through kafka_handle_error_mode='stream' to errors_mv and the consumer lost them. Bumping to 14000m raises the cap to 12.30 GiB. kafka_poll_max_batch_size: 64 → 16 Bumping the cap alone did NOT help — CH grew to fill the new headroom (MemoryTracking 10.45 → 12.11 GiB) and the same 131 MiB allocation still occasionally hit the new cap. WORSE, with more per-batch memory in flight the per-push processing time during a rejected allocation exceeded max.poll.interval.ms (5 min default), the consumer got kicked from the kafka group, rejoined, and re-read the same batch from the last committed offset → rebalance death loop (offset frozen for the entire hour I left it running). batch_size=16 keeps the per-poll buffer at ~33 MiB instead of ~131 MiB, and shortens the per-push processing time enough that even under memory pressure the consumer stays inside the poll-interval window. No more rebalance kicks. Measured at T+31m of a fresh smoke (compared to the prior 8h soak baseline of 12000m / batch=64 over 480 min): 8h soak baseline This config (31m) ch_rows 12 237 9 877 total OOMs 1 383 67 rows / minute 25 319 (12.8× faster) rows per OOM 8.8 147 (16.8× more efficient) The OOM rate per minute (~2.2/min) is similar to the baseline, but each OOM costs far fewer rows because the consumer recovers quickly and the in-flight batch is smaller. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Captured from the 20 GB / 28 GB container bumps we tried after the 14 GB / batch=16 validated config. Two non-obvious findings: 1. ClickHouse's MemoryTracking grows to fill whatever per-server cap the container limit implies. The kafka_engine 131 MiB batch alloc keeps tipping the tracker over the cap at the same workload-driven rate (~2.3/min) regardless of how high the cap is set. 2. Past ~20 GB container, per-flush MV insert time grows sharply (8 rows / 37 s at 12 GB → 8 rows / 197 s at 28 GB). That blows past max.poll.interval.ms, the consumer is kicked, and ch_rows freezes in a rebalance death loop — net REGRESSION. The proper fix for the residual OOMs is to cap ClickHouse's discretionary caches via config.d so the tracker stops growing into the cap. That's a separate change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two coordinated changes to bound discretionary memory use in the mixed clickhouse-pipeline-parquet flavor. ClickHouse config.d/limit_memory.xml: * mark_cache_size 5 GiB → 256 MiB * index_mark_cache_size 5 GiB → 128 MiB * uncompressed_cache_size already 0; explicit * index_uncompressed_cache_size 0 * compiled_expression_cache_size 128 MiB (unchanged) * leave max_server_memory_usage_to_ram_ratio at default 0.9 Our working set is tiny (~55 MiB of MergeTree data); the 5 GiB default mark cache was hilariously oversized for the workload. Redpanda: --memory=1G --reserve-memory=0M Was unbounded under --mode=dev-container. Bounded to 1 GiB; observed RSS now 255 MiB. Frees ~700 MiB of host RAM previously over-reserved. Measured at T+31m of a fresh smoke vs the 14000m / batch=16 baseline from commit f6f9a86: baseline this config ch_rows / T+31m 9 877 12 167 (+23 %) total OOMs 67 68 (no change) CH container RSS 9.5 GiB 6.0 GiB (-37 %) MemoryTracking (idle) 12.11 GiB 1.29 GiB errors_mv rows 67 68 The OOM RATE is unchanged because the OOMs come from peak kafka_engine batch processing (transient 10+ GiB allocation across decode buffer + column buffers + compression buffers) — not from the persistent caches. The caches were the steady-state memory consumer; capping them frees the budget for the transient peaks and gives better throughput, but doesn't eliminate the per-batch peak hitting the cap. A real zero-OOM fix would require reducing the per-batch peak allocation itself (smaller kafka_poll_max_batch_size, fewer columns in the MV, or a custom kafka_engine config). Out of scope here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drops `kafka_max_block_size` from 65,536 → 1,024 rows and `kafka_flush_interval_ms` from 5000 → 2000 ms. Diagnosis (credit to dave): Since the migration to the ProtobufList wire format, each kafka message is already an Envelope containing ~100-1000 XtcpFlatRecord rows. The kafka_engine's own row-level Block accumulator (default 65,505 rows) sits on top of that batching — it accumulates rows from many ProtobufList messages before flushing through the MV. ClickHouse pre-allocates per-column buffers sized for the FULL Block capacity at flush time. With 122 columns × 65K rows worth of pre-allocated buffer + ZSTD/LZ4 compression contexts + MV pipeline state, MemoryTracking parked at ~10 GiB and the 131 MiB chunk allocations occasionally tipped the per-server memory cap. None of that memory was data — our actual workload is ~430 rows/sec ≈ 215 KB/sec on the wire. Setting block_size to ~1 envelope (1024 rows) makes the kafka_engine effectively pass each ProtobufList through to the MV without redundant accumulation. Per-flush column buffers are 64× smaller. Measured before/after on a fresh boot of the mixed flavor: block=65536 / flush=5s block=1024 / flush=2s MemoryTracking (idle) 9.31 GiB 178 MiB (53×) MemoryTracking (peak) 10-12 GiB 246 MiB (40×) MEMORY_LIMIT_EXCEEDED 67 / 31 min 0 errors_mv rows 68 0 Throughput 319-393 rows/min ~27,000 rows/min (~70×) Consumer commits / msgs 2 / 426 (rebalance loop) 69 / 69 (1:1) The throughput now matches xtcp2's actual production rate (~430 rows/sec) — the consumer is running in real-time with no backlog. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Updates the Troubleshooting section to: * mark the earlier "bumping memory doesn't help" entry as historical * document the real fix from c52e4e5: kafka_max_block_size = 1024 + kafka_flush_interval_ms = 2000 * explain WHY ProtobufList + the default 65K-row Block was redundant and over-allocated column buffers * include the before/after measurement table so the next debugger sees what good looks like * note the regression check (SHOW CREATE TABLE to verify the setting hasn't drifted back to the default) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two small fixes so the in-VM Prometheus is useful for long-soak stability tracking: 1. Add a host:19090 → guest:9090 forward (was previously commented out). Lets a host-side scrape or curl reach the in-VM TSDB directly without TTY hops. 2. In the clickhouse-pipeline-parquet mixed flavor, add the second xtcp2 instance on :9089 as a scrape target. Both instances now show up as separate `instance` labels (xtcp2-primary, xtcp2-parquet) so goroutine / memory / GC trends can be compared side-by-side over a 24h soak. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two small bash helpers for monitoring a running mixed-flavor microvm via its host-forwarded :19090 Prometheus endpoint: * clickpipe-prom-probe.sh — one-line per-instance snapshot of go_goroutines, go_memstats_heap_inuse_bytes (MiB), go_threads for both xtcp2-primary and xtcp2-parquet. Used inside the soak monitor loop for periodic probes. * clickpipe-stability-summary.sh — soak-end report. Queries current/max for goroutines, OS threads, heap, RSS over the soak window, plus total GC pause time. Useful for "did anything drift?" judgement after a 4-24h run. The 4h soak passed with these: 6.3M rows ingested, zero OOMs, goroutine drift bounded at +13-18, heap oscillates normally with GC. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two fixes uncovered by a 24h soak that crashed at T+21h: 1. Redpanda was unbounded. Its `start --memory=1G` flag is a seastar data-plane reservation, not an OS cgroup limit — the rest of the process can allocate freely. Over 21h it grew until it triggered the system OOM-killer (`folio_prealloc 12.9 GiB`), which then chose the largest victim (clickhouse-serv at 11.9 GiB RSS) and killed it. The fix is a real docker `--memory=2G` cgroup cap on the redpanda container. 2. `CLICKHOUSE_ALWAYS_RUN_INITDB_SCRIPTS=true` made every container restart re-run initdb.d scripts, which DROP and recreate xtcp.xtcp_flat_records — so when CH crashed during the soak, docker's `--restart on-failure` brought it back but with zero rows. Removed; initdb now runs only on first-time volume init (when /var/lib/clickhouse is empty). Verified by docker-killing the live container — comes back via `docker start`, ch_rows intact (19180 before kill → 24044 after, consumer caught up). Together these mean an OOM-induced or operator-induced CH restart during a 24h soak doesn't lose data, and redpanda can't trigger that OOM in the first place. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…erts A 24h soak retry just got stuck after 1 h: consumer in rebalance death loop, ch_rows frozen at ~21 k, OOMs climbing despite the kafka_max_block_size=1024 fix. Root cause: librdkafka's max.poll.interval.ms is 5 min by default, and our MV flush occasionally takes 30-150 s (memory pressure, parts merge, ZSTD on 122 columns). Once that happens during the startup race window when CH is hot, the consumer gets kicked, rejoins at the last committed offset, re-reads the same batch, fails the same way → indefinite loop. config.d/kafka_client_tuning.xml extends: * max.poll.interval.ms 5 min → 15 min (900000 ms) * session.timeout.ms 45 s → 5 min (300000 ms) * heartbeat.interval.ms explicit 10 s 15 min covers any plausible MV-flush spike. session.timeout.ms stays well below it. The earlier 4h soak completed cleanly only because it happened to dodge this trap; the 24h soak attempts hit it more reliably because of longer total time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

A 24h soak v3 attempt got stuck at ~22k rows after 1h. Consumer was no longer in a rebalance death loop (commits succeeding), but MV inserts had gone pathologically slow — Pushing 2.45k rows took 414 seconds. system.asynchronous_metrics shows the cause: jemalloc.retained 18.15 GiB ← held but unused chunks jemalloc.allocated 12.35 GiB MemoryResident 9.44 GiB ← actual physical RAM MarkCacheBytes 0 B ← our caches are capped, fine ClickHouse's MemoryTracker (12.20 GiB) hits its 12.30 GiB cap because of those retained jemalloc chunks even though actual RSS is just 9.44 GiB. Every new alloc has to wait for the tracker to drop below the cap → slow MV inserts. MALLOC_CONF=background_thread:true,dirty_decay_ms:1000,muzzy_decay_ms:1000 tells jemalloc to: * run a background thread that purges unused chunks * mark dirty pages "muzzy" after 1 s of disuse (default 10 s) * return muzzy pages to OS after 1 s (default 10 s) End result: retained chunks return to the OS quickly, MemoryTracker sits well below the cap, MV inserts run at normal speed. This is the standard remedy for long-running ClickHouse instances showing jemalloc.retained bloat. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The 24h v4 soak ran 0-22h cleanly with the MALLOC_CONF jemalloc fix, then collapsed at T+22h because: * /var/lib/docker on the 8 GiB sparse image was 99 % full at T+22h (CH parts 2.92 GiB + redpanda log + dockerd overhead = 7.3 GiB) * /var/lib/minio on the default 512 MiB tmpfs was 100 % full — the parquet path writes ~10 MiB/min and accumulated 507 MiB of files over 22 h. * Throughput collapsed to ~5 % of normal once NOT_ENOUGH_SPACE started firing on every kafka_engine commit. Fixes: * microvm.volumes: docker.img 8192 → 16384 MiB * microvm.volumes: add a dedicated 16384 MiB MinIO image at /var/lib/minio (gated on isClickPipeParquet) * minio-bucket-bootstrap.nix: new `useTmpfs` flag (default true) so the module skips its tmpfs declaration when the caller is providing a real disk xtcp2 itself was bulletproof across the full 24h: goroutines drifted only +37-43 over 24h, OS threads +34-38, heap oscillated normally with GC, RSS bounded at 247 MiB peak. The "bulletproof 24h" target is met by the daemon — these changes just keep the supporting infrastructure from filling up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The s3parquet layer's new Go files and the touched microvm nix files weren't formatted to the repo's pinned gofmt/nixfmt; format them so the gofmt and nix-fmt checks pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…leged The s3parquet layer added a fail-early startup capability check (checkCapabilities → x.fatalf) to Init(). NewXTCP / NewNsTestingXTCP both call Init, so any test that constructs an XTCP (pkg/xtcp TestNewXTCP_runsToCompletion, cmd/ns TestRunDaemonDefault_constructs) os.Exit'd the test binary on sandboxes lacking CAP_SYS_ADMIN / CAP_NET_ADMIN — the stack only ran these inside the cap-granting microVM. Indirect the gate through a package var (matching the existing constructorRegistry / netNsCandidateDirs seams) and add SetCapabilityCheck; TestMain in each package installs a no-op. The capability logic itself is still exercised directly, with the real method, in init_capabilities_test.go. Production behaviour is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…dConfig printFlags and buildConfig dereference the s3parquet/pyroscope mainFlags fields the s3parquet layer added, but both test fixtures were never updated to allocate the four pyroscope pointers — so both tests nil-deref panicked. Allocate them like the real defineFlags does. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ine) The test polls for the queueFull counter and breaks the instant it ticks, so a passing run finishes in milliseconds — but the 2s safety deadline was tight enough that a loaded full-suite run (esp. under -race) could trip a false 'counter never ticked' failure. Widen the deadline; the happy path is unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

randomizedcoder · 2026-06-15T04:08:05Z

Note for review: golangci-lint findings deferred to #22

go vet / gofmt / nix-fmt / the 4 audits pass, but the full golangci-lint tier is red — and this PR adds to it. Per discussion, these are handled in the (now-expanded) cleanup PR #22, not here, because they're faithful stack content confirmed present at the s3parquet tip:

Pre-existing on main (2) — poller.go:24 contextcheck, xtcp.go:251 gosec G118.
Added by this PR (~20) — the s3parquet layer's nsTest -traffic/-conns commits (noctx exec.Command→CommandContext, net.Listen→ListenConfig; forbidigo runtime.UnlockOSThread; errcheck unchecked Close) + new init_capabilities.go ST1005 (capitalized error string).

None are in the test-seam / fixture code this PR adds. PR #22 drives the whole tree to golangci-lint-comprehensive green (real handling, no new nolint) alongside the xsync typed wrappers + //nolint:errcheck removal.

randomizedcoder and others added 30 commits June 14, 2026 19:52

randomizedcoder and others added 6 commits June 14, 2026 19:53

randomizedcoder merged commit 5da799c into main Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3parquet destination + OOM tuning (final stack layer) — fixes 3 pre-existing test failures#29

s3parquet destination + OOM tuning (final stack layer) — fixes 3 pre-existing test failures#29
randomizedcoder merged 36 commits into
mainfrom
s3parquet-destination-pr

randomizedcoder commented Jun 15, 2026

Uh oh!

randomizedcoder commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

randomizedcoder commented Jun 15, 2026

Summary

Conflicts resolved

Pre-existing test failures found in the branch tip — fixed here

Testing

Uh oh!

randomizedcoder commented Jun 15, 2026

Note for review: golangci-lint findings deferred to #22

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant